Seperated text line in Apache POI XWPFRun object
Asked Answered
A

5

13

I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.

My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.

Best

Akilahakili answered 2/10, 2013 at 13:16 Comment(2)
and I think it occurs if the tag is in a table!Akilahakili
no body hasn't a tip about this?Akilahakili
I
19

This wasted so much of my time once...

Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.

So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.

How to avoid splitting of Runs in such cases?

Solution: There are two solutions that I know of:

  1. Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.

  2. Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.

If you have to change your new placeholder again, follow step 1 or 2.

Infare answered 8/7, 2015 at 7:56 Comment(1)
One more solution is to edit document.xml from docx archive so that "[PLACEHOLDER_NAME]" is in a <w:t>...</w:t> tag.Olericulture
B
2

Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.

public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
  for (XWPFParagraph p : doc.getParagraphs()) {
    List<XWPFRun> runs = p.getRuns();
    List<Integer> group = new ArrayList<Integer>();
    if (runs != null) {
      String groupText = search;
      for (int i=0 ; i<runs.size(); i++) {
        XWPFRun r = runs.get(i);
        String text = r.getText(0);
        if (text != null)
            if(text.contains(search)) {
              String safeToUseInReplaceAllString = Pattern.quote(search);
              text = text.replaceAll(safeToUseInReplaceAllString, replace);
              r.setText(text, 0);
            }
            else if(groupText.startsWith(text)){
              group.add(i);
              groupText = groupText.substring(text.length());
              if(groupText.isEmpty()){
                runs.get(group.get(0)).setText(replace, 0);
                for(int j = 1; j<group.size(); j++){
                  p.removeRun(group.get(j));
                }
                group.clear();
                groupText = search;
              }
            }else{
              group.clear();
              groupText = search;
            }
        }
    }
}
for (XWPFTable tbl : doc.getTables()) {
   for (XWPFTableRow row : tbl.getRows()) {
      for (XWPFTableCell cell : row.getTableCells()) {
         for (XWPFParagraph p : cell.getParagraphs()) {
            for (XWPFRun r : p.getRuns()) {
              String text = r.getText(0);
              if (text.contains(search)) {
                String safeToUseInReplaceAllString = Pattern.quote(search);
                text = text.replaceAll(safeToUseInReplaceAllString, replace);
                r.setText(text);
              }
            }
         }
      }
   }
}

}

Boru answered 19/7, 2017 at 14:53 Comment(4)
This helped me. The bit with removeRun doesn't work for me because the index changes as you delete - replacing that line with p.removeRun(1) fixes this, and it works a treat.Ashly
I was too hasty, and didn't test my change with enough data. Instead, simply replace the for loop wrapping removeRun with one that goes in reverse. i.e. for(int j=group.size()-1; j>=1; j--) and it works for me.Ashly
@Ashly i have similar issue, can u help or any1 here if you can help.. here is thr link to Q https://mcmap.net/q/905019/-how-to-change-content-inside-table-nested-table-in-docx-file-using-apache-poi/13267143Gayton
I'm running into this same issue and taking this code as inspiration. I see two problems with this implementation: 1) the first Run could start with some other text and then contain a part of the search pattern. 2) the last Run could contain text that should not be removed or could contain the start into a new search pattern. Otherwise thank you for publishing this code- it gave me some ideas how to go about the solution.Letter
M
1

For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words.
What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with

for (XWPFRun r : p.getRuns()) {
  String text = r.getText(0);
  if (text != null && text.contains("originalText")) {
     text = text.replace("originalText", "newText");
     r.setText(text,0);
     }
  }

I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text.
For me one of a additional advantage was that visualy I'm able to faster find placeholders in text. So finally above loop looks like that:

for (XWPFRun r : p.getRuns()) {
      String text = r.getText(0);
      if (text != null && text.contains("originalText")) {
         text = text.replace("originalText", "newText");
         r.setText(text,0);
         r.isBold(false);
         }
      }

I hope it will help to someone, while I spend too much time for that :)

Mordacious answered 4/2, 2020 at 7:34 Comment(1)
G
1

To be sure that a word will be consider as a single XWPFRun, You can use merge_field as variable in word like that

  1. Place cursor on the word you want to be a single run.
  2. Press CTRL and F9 together and { } in gray will appear.
  3. Right-click on the { } field and select Edit Field.
  4. In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
  5. Click OK.
Gunfire answered 17/8, 2022 at 9:48 Comment(6)
I used your method, but I don't understand why my MergeField is surrounded by «», what could be the reason?Helbonnah
@ThePrototype The «» are used to define the merged fields in Microsoft Word. you can Press ALT + F9 to toggle Field Codes on/off.Gunfire
@Yehound, thanks, but then it looks something like this: { MERGEFIELD ${placeholder} }Helbonnah
when you go to replace the merge fields with a value with Apache POI, the «» should disappearGunfire
@Yebouda Unfortunately this is not happening #76322717Helbonnah
@ThePrototype my merge fields looks like that { MERGE FIELD placeholder} (without the $ and the other pair of arrow) maybe that is the reason. and to isolate the variable I use _ before and after like that : { MERGE FIELD _placeholder _}Gunfire
W
0

I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.

Woodcock answered 6/12, 2013 at 15:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.